Towards Using Web-Crawled Data for Domain Adaptation in Statistical Machine Translation

نویسندگان

Pavel Pecina

Antonio Toral

Andy Way

Vassilis Papavassiliou

Prokopis Prokopidis

Maria Giagkou

چکیده

This paper reports on the ongoing work focused on domain adaptation of statistical machine translation using domain-specific data obtained by domain-focused web crawling. We present a strategy for crawling monolingual and parallel data and their exploitation for testing, language modelling, and system tuning in a phrase-based machine translation framework. The proposed approach is evaluated on the domains of Natural Environment and Labour Legislation and two language pairs: English–French and English–Greek.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Domain Adaptation of Statistical Machine Translation using Web-Crawled Resources: A Case Study

In this research, we tackle the problem of domain adaptation of Statistical Machine Translation by exploiting domainspecific data acquired by domain-focused web-crawling. We design and empirically evaluate a procedure for automatic acquisition of both monolingual and parallel data and their exploitation for system training, tuning, and testing in a phrase-based Statistical Machine Translation f...

متن کامل

MT Adaptation for Under-Resourced Domains - What Works and What Not

In this paper the authors present various techniques of how to achieve MT domain adaptation with limited in-domain resources. This paper gives a case study of what works and what not if one has to build a domain specific machine translation system. Systems are adapted using in-domain comparable monolingual and bilingual corpora (crawled from the Web) and bilingual terms and named entities. The ...

متن کامل

Domain Adaptation for Medical Text Translation using Web Resources

This paper describes adapting statistical machine translation (SMT) systems to medical domain using in-domain and general-domain data as well as webcrawled in-domain resources. In order to complement the limited in-domain corpora, we apply domain focused webcrawling approaches to acquire indomain monolingual data and bilingual lexicon from the Internet. The collected data is used for adapting t...

متن کامل

Mining and Exploiting Domain-Specific Corpora in the

The objective of the PANACEA ICT-2007.2.2 EU project is to build a platform that automates the stages involved in the acquisition, production, updating and maintenance of the large language resources required by, among others, MT systems. The development of a Corpus Acquisition Component (CAC) for extracting monolingual and bilingual data from the web is one of the most innovative building bloc...

متن کامل